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Acronyms 


• Application specific integrated circuit (ASIC) 

• Block random access memory (BRAM) 

• Block Triple Modular Redundancy (BTMR) 

• Clock (CLK or CLKB) 

• Combinatorial logic (CL) 

• Configurable Logic Block (CLB) 

• Digital Signal Processing Block (DSP) 

• Distributed triple modular redundancy (DTMR) 

• Edge-triggered flip-flops (DFFs) 

• Equivalence Checking (EC) 

• Error detection and correction (EDAC) 

• Field programmable gate array (FPGA) 

• Finite State Machine (FSM) 

• Gate Level Netlist (EDF, EDIF, GLN) 

• Global triple modular redundancy (GTMR) 

• Hardware Description Language (HDL) 

• Input- output (I/O) 

• Linear energy transfer (LET) 

• Local triple modular redundancy (LTMR) 

• Look up table (LUT) 


• Multiple Bit Upsets (MBUs) 

• Naval Research Laboratory (NRL) 

• Operational frequency (fs) 

• Power on reset (POR) 

• Place and Route (PR) 

• Radiation Effects and Analysis Group (REAG) 

• Single error correction double error detection 
(SECDED) 

• Single event functional interrupt (SEFI) 

• Single event effects (SEEs) 

• Single event latch-up (SEL) 

• Single event transient (SET) 

• Single event upset (SEU) 

• Single event upset cross-section (o SEU ) 

• Static random access memory (SRAM) 

• System on a chip (SOC) 
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Agenda 

• Single Event Upsets (SEUs) in FPGAs and Fail-Safe 
Overview. 

• Single Event Upsets and FPGA Configuration. 

• Single Event Upsets n an FPGA’s Functional Data Path 
and Fail-Safe Strategies. 

• Fail-Safe Strategies for FPGA Critical Applications. 

• Fail-Safe State Machines. 
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Definitions 

• A Field-Programmable Gate Array (FPGA) is a 
semiconductor device containing configurable logoc 
components called "logic blocks", and configurable 
Interconnects. Logic blocks can be configured to perform 
the function of basic logic gates such as AND, and XOR, or 
more complex combinational functions such as decoders 
or mathematical functions. 

• An application-specific integrated circuit (ASIC) is a 
semiconductor device designed for a particular use. Its 
designs are considered more custom. Processors, RAM, 
ROM, etc... are examples of ASICs. 

■ From a user’s perspective , an 
FPGA is an ASIC designed to have 
a ‘‘sea ’’ of configurable logic for 
general purpose usage. 
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Creating A Design in An Integrated 
Circuit Device (FPGA or ASIC) 

* The objective is to create a hardware 
design using hardware description 
language (HDL): a 

- docks, Combmatonm 

- Resets, 

- Sequential elements 
(e.g., flip-flops), 

- Combinatorial logic. 





* The description gets synthesized into a hardware gate-level- 
netlist (GLN: file listing gates and connectivity). 

* The synthesized hardware gates are mapped and placed into 
the cell library (or logic blocks) of the target FPGA or ASIC. 

* ASIC flow produces a mask that is handed to a foundry. FPGA 
flow produces a configuration file. 
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Design Tools 

• Design tools are used for each step of the design process. 

• Synthesis: maps HDL into logic blocks (cells) ... outputs 
gate-level net-lists. 

» Place and route (PR): optimizes where the logic blocks 
and their interconnects should be within the device. 

» Synthesis along with PR tools contain optimization 
algorithms within their tool sets. 

- These algorithms are used to optimize area, power, and logic 
function. 

- Tools are difficult to create. Poorly designed tools can create 
designs that are: functionally incorrect, too large to fit into the 
target device, or output too much power. Hence, a bad tool can 
produce unusable designs. 

- Equivalence checking (EC) verifies tool output matches HDL. 

Best practice is to use a proven vendor’s tool set - or 
product might be unreliable or unusable. 
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HDL: Hardware description language 
STA: Static timing analysis 
EC: Equivalence checking 


ASIC Design Flow 



Functional 

specification 




HDL 


Floorplanning, 
clock balancing, 
place and route, 
and timing closure 


L 
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Behavioral simulation 


Synthesis 



STA, EC, and gate' 
level simulation 


Physical Design: Hand off to 
back-end design house 


STA, and back 
annotated gate- 
level simulation 



To be presented by Melanie Berg via WebEx at SERRESSA 2015 International School on the Effects of Radiation on Embedded Systems for 
Space Applications, Puebla, Mexico, November 30 to December 4, 2015. 


8 





Manufacturer 





FPGA Design Flow 

FPGAs are created by manufacturers and are sold to 
users. The user maps a des gn nto the FPGA fabric. 

l\B 

rd i 


nwiv i 

Flow 


FPGA 


Manufacture)* 


Manufacturer 
creates FPGA 
design structure: 
logic block cells, 
routing 
structures, 
configuration 


Manufacturer 
sends FPGA 
circuit to foundry 


FPGAs are sold 
to users with 
configurable 
logic blocks and 
routes (they do 
not contain 
operable design) 


To be presented by Melanie Berg via WebEx at SERRESSA 2015 International School on the Effects of Radiation on Embedded Systems for 
Space Applications, Puebla, Mexico, November 30 to December 4, 2015. 



HDL: Hardware description language 

STA: Static timing analysis CDriA I leor HoCI 
EC: Equivalence checking * * vwol 



Functional 

Specification 




HDL 


n 


Looks like ASIC 
design flow ... 
but ...without 
the wait time 


Synthesis 


Performed by user 
with manufacturer 
FPGA specific 
design tools 


le 


Place and Rou 



Create 

Configuration 


n Flow 


Behavioral Simulation 





STA, EC, and Gate 
Level Simulation 


STA, and back 
annotated Gate Level 
Simulation 




User creates a design 
that is mapped into a 
manufacturer provided 
FPGA 


To be presented by Melanie Berg via WebEx at SERRESSA 2015 International School on the Effects of Radiation on Embedded Systems for 
Space Applications, Puebla, Mexico, November 30 to December 4, 2015. 


10 






FPGA or ASIC? 
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FPGA and ASIC Devices ... System 

Usage 

• An FPGA (sEm^larly to an ASIC) can be used to solve 
any problem which is computable: 

- User implements a digital (or mixed signal design). 

- Design can be trivial glue-logic (e.g., interface control) or 

- Design can be as complex as a system on a chip that may 
include processors, embedded memory, and high speed 

serial interfaces (Gigabit SERDES) . SERDES: serializer deserializer 

• The number of gates contained within the original 
FPGA devices were too small to compete with the 
ASIC devices of that time (1980s). 

- FPGAs were mostly used as interface glue logic. 

- Reduced system cost and added flexibility. 

• Modern-day FPGAs contain millions of gates and have 
taken over a significant amount of the ASIC market. 
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The ASIC Advantage 


ASIC Advantage 

Comment/Explanation 

Full custom 
capability 

The design is “tailored” and is 
manufactured to design 
specifications (no additional hidden 
logic) 

Lower unit costs 

Great for very high volume projects 

Smaller form 
factor 

Less logic is required because device 
is manufactured to design specs 

No configuration 

Overall reliability can decrease due to 
the addition of configuration 
technology/logic 

Lower power 

Less logic is required because device 
is manufactured to design specs 
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The FPGA Advantage 



FPGA Advantage 

Comment/Explanation 

Faster time-to-market 

No layout, masks or other 
manufacturing steps are needed 

No upfront non-recurring 
expenses (NRE) 

Costs typically associated with an 
ASIC design 

Simpler design cycle 

Due to the required tools that handle 
routing, placement, and timing 

More predictable project cycle Due to elimination of potential re-spins 

and lack of concern regarding wafer 
capacities as it would be in ASICs 

Field reprogramability 

It is easier to change a design in a 
system 

Engineer availability 

More students are taught FPGA design 
in school 


FPGA: Faster design cycle and cheaper to implement 
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What’s Inside FPGA Devices? 
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General FPGA Architecture: Fabric Containing 
Customizable Preexisting Logic. ..User 

Building Blocks 
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FPGA Structure Categorization as 
Defined by NASA Goddard REAG: 



Single event functional interrupts (SEFI) 
SEFI out of presentation scope 


SEU cross section: cr SEU 


P(fs) 


$ 1 erwr ^ ^Configuration 

Design <J S eu Configuration 


+W 


+ P 

functionalLogic SEFI 

Functional logic SEFI <J SEU 
°SEU 



Sequential and 
Combinatorial 
logic (CL) in 
data path 



Global Routes 
and Hidden 
Logic 


SEU Testing is required in order to characterize the 

o SEU s for each of FPGA categories. 
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How Do FPGA’s Differ? 


• Manufacturer Architecture (not all are listed): 

- Configuration, 

- User building blocks (combinatorial logic cells, sequential logic 
cells), 

- Routing, 

- Clock structures, 

- Embedded mitigation, and 

- Embedded intellectual property (IP); e.g., memories, complex I/O 
management, phase locked loops (PLLs), and processors. 

• Manufacturer design tool environment: 

- Synthesis, 

- Place and Route, and 

- Configuration management output. 

Difference in architectures and tools will affect the 
final design and design process - users be aware. 
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FPGA Component Libraries: Basic 
Designer Building Blocks (They Differ 

per FPGA Type) 



• Combinatorial logic 
(CL) blocks 

- Vary in complexity. 

- Vary in I/O. 


• Sequential logic blocks 
(OFF) 

- Uses global Clocks. 

- Uses global Resets. 

- May have mitigation. 
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User Maps the Design Logic into FPGA 

Preexisting Logic 

Hardware design language (HDL) / / ^ 




300 



Synthesis 


o 

o 





Combinatoria 

FPGA 

Equivalent 

Block 




FF 

FPGA 
Equivalent 
Block 
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FPGA Configuration (Storage of User 



• Configuration Defines: 

Arrangement of pre-existing 
logic via programmable 

switches. 

Functionality (logic cluster) and 
e Connectivity (routes) 

® Programmable Switch 

Types: 

0 Antifuse: One time 
Programmable (OTP), 

9 SRAM: Reprogrammable (RP), 

or 

Flash: Reprogrammable (RP). 
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I/O CONNECTS 


ROUTING MATRIX 





ooo 



ooo 


PROGRAMMABLE 

SWITCHES 
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FPGA’s And Critical Applications 
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Common FPGA Applications 


• Controllers, 

• Dataflow and interface adaptation, 

• Digital signal processing (DSP), 

• Software-defined radio, 

• ASIC prototyping, 

• Medical imaging, 

• Robotic control (vision, movement, speech, etc ) 

• Cryptology, 

• Nuclear plant control, 

• The list goes on... 


To be presented by Melanie Berg via WebEx at SERRESSA 2015 International School on the Effects of Radiation on Embedded Systems for 
Space Applications, Puebla, Mexico, November 30 to December 4, 2015. 



Mars Rover 


Soil Moisture 
Active Passive 


Spacecube: 
International 
Space Station 
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Common Applications Example 2: 

Terrestrial 



Digital Cluster 



Emissions Control 


Advanced Suspension 
and Traction Control 


Navigation and 
Telematics Displays 

Personnel Occupancy 
Detection Systems (PODS) for 
Next-Generation Airbags 


Blind-Spot 
Warning System 


Engine Control Module 


Lane Departure 
Warning System 


Rear-Seat Entertainment 
Source MUXing 


Back-Up Camera 


Back-Up Sensors 


Adaptive Cruise 
Control 


Power Steering 
Control 


Multi-Axis Power 
Seat Control 


Collision 

Avoidance System 


Injector Control 
(especially diesel engines) 


http://www.eetimes.com/document.asp?doc_id=1 305894 


To be presented by Melanie Berg via WebEx at SERRESSA 2015 International School on the Effects of Radiation on Embedded Systems for 
Space Applications, Puebla, Mexico, November 30 to December 4, 2015. 


25 



Concerns for using FPGA Devices in 

Critical Applications 






Critical applications will want to 
Safety: can circuits or avoid disaster. 

humans be damaged or hurt? 

Reliability : will the dev ce 
operate as expected? 

Availability: how often will the 
system operate as expected? 

Recoverability: if the device 
malfunctions, can the system 
come back to a working 
state? 

Trust: Will the insertion of the 
device compromise security? 
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Sources of FPGA Failures 






Packaging and 
mounting . ^ 

Poor design 
choices 


Negative bias 
temperature 
instability (NBTI) 


Hot carrier 
injection (HCI), 


Dielectric 
breakdown 
(DB) 


Total ionizin 
dose (TID) 


Electromigration 

(EM) 


Single event 
effects (SEEs) 

Environmental 

stress 


Transis 
switching 


Lack of 

verification 
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How To Protect A System from Failure 

• Investigate failure modes - understand risk: 

- Reliability testing (temperature, voltage, mechanical, and logic 
switching stresses). 

- Radiation testing: Single event effects (SEE) and total ionizing 
dose (TID). 

• Add redundancy: 

- Replication with correction. 

- Replication with detection. Requires recovery: 

• Switch to another device, 

• Try to recover state, 

• Start over, 

• Alert, 

• Do nothing... die. 

• Add filtration: e.g., Finite impulse response (FIR) filters 
or Constant false alarm rate filter (CFAR). 

• Add masking: Protect system operation from failures. 

To be presented by Melanie Berg via WebEx at SERRESSA 2015 International School on the Effects of Radiation on Embedded Systems for 
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Single Event Upsets (SEUs) and FPGA **** 

Devices 

• Although there are many sources of FPGA 

malfunction, this presentation will focus on SEUs as a 
source of failure. 
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Implications of SEUs to FPGA 

Applications 

• Ion zing particles cause upsets (SEUs) in FPGAs. 

• Each FPGA type has different SEU error signatures: 

- Temporary glitch (transient), 

- Change of state (incorrect state machine transitions), 

- Global upsets: Loss of clock or unexpected reset, 

- Configuration corruption. This includes route breakage (no 
signal can get through) - can be overwhelming. 

• The question is how to avoid system failure and the 
answer depends on the following: 

- The system’s requirements and the definition of failure, 

- The target FPGA and its surrounding circuitry susceptibility, 

- Implemented fail-safe strategies, 

- Reliable design practices, 

- Radiation environment. 
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SEE Go-no-Go: Single Event Hard 
Faults and Common Terminology 

• Single Event Latch Up (SEL): Device latches in high 
current state: 

- Has been observed in FPGA devices that are currently on the 
market. 

- Some missions choose to use the devices and design around 
the SEL. 

• Single Event Burnout (SEB): Device draws high 
current and burns out. 

- Not observed in FPGA devices that are currently on the 
market. 

• Single Event Gate Rupture: (SEGR): Gate destroyed 
typically in power MOSFETs. 

* Not observed in FPGA devices that are currently on the 
market. 
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Fail-Safe Strategies for FPGA 
Critical Appi 


Goal for critical 
applications: 
Limit the probability 
of system error 
propagation and/or 
provide detection- 
recovery 

mechanisms via fail- 
safe strategies. 


f ot o I La 


To be presented by Melanie Berg via WebEx at SERRESSA 2015 International School on the Effects of Radiation on Embedded Systems for 
Space Applications, Puebla, Mexico, November 30 to December 4, 2015. 


Mitigation 

• Error Masking vs. Error Correction... there’s a 
difference. 

• Mitigation can be: 

- User inserted: part of the actual design process. 

* User must verify mitigation... Complexity is a RISK!!!!!!!! 

- Embedded: built nto the device library cells. 

* User does not verify the mitigation - manufacturer does. 

• Mitigation should reduce error... 

- Generally through redundancy. 

- Incorrect implementation can increase error. 

- Overly complex mitigation cannot be verified and 
incurs too high of a risk to implement. 
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Differentiating Fail-Safe Strategies: 

• Detection: 

- Watchdog (state or logic monitoring). 

- Simplistic Checking ... Complex Decoding. 

- Action (correction or recovery). 

• Masking (does not mean correction): 

- Not letting an error propagate to other logic. 

- Redundancy + mitigation or detection. 

- Turn off faulty path. 

» Correction (error may not be masked): 

- Error state (memory) is changed/fixed. 

- Need feedback or new data flush cycle. 

• Recovery: 

- Bring system to a deterministic state. 

- Might include correction. 
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Availability versus Correct Operation 

* Requirements must be satisfied. 

• What is your expected up-time versus down-time 
(availability)? 

* Is correct operation well defined? Unambiguous! 

* Is system failure well defined? Unambiguous! 

• Can availability and correct operation be deterministic 
regardless of error signature? 

• Availability: 

- Flushable designs: systems than can be reset or are self- 
correcting. Availability is affected during reset or correction 
time (down-time). However, downtime is tolerable as defined 
by system requirements. 

- Non-flushable designs: System requirements are strict and 
require minimal downtime. Usage of resets are required to be 
kept at a minimum. 
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Detection and Recovery 

• Not all mitigation schemes require detection. 

• Questions/Considerations: 

- If your scheme requires detection: 

* Can the system detect all error signatures? 

* Can the system detect all error signatures fast 
enough? 

* Do different errors require different recovery 
schemes... can the system accommodate. 

- How are you gorng to verify the detection and 
recovery? 

- How much downtime will there be during recovery 

Availability = detection + recovery time - masked error time 

‘‘Yes or ‘‘We know it will work" are not good enough 
answers: Ask how and if the scheme has been verified! 
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SEUs and FPGA Variations 


• FPGA susceptibilities (error signatures) vary per 
FPGA type. 

• How does a project manage and protect against 
FPGA susceptibilities? (mitigation schemes will 
change based on FPGA type). 

• The most efficient solution will be based on 
understanding: 

- SEE theory, 

- FPGA SEE susceptibility (per FPGA type), 

- Proven mitigation strategies per FPGA type, 

- Validation and verification of implemented mitigation 
strategies, and 

- Limitations of tools and/or mitigation schemes. 
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Redundancy Is Not Enough 

• Just adding redundancy to a system is not 
enough to assume that the system is well 
protected. 

• Concerns that must be addressed for a critical 
system expecting redundancy to cure all (or 
most): 

- How is the redundancy implemented? 

- What portions of your system are protected? Does the 
protection comply with the results from radiation 
testing? 

- Is detection of malfunction required to switch to a 
redundant system or to recover? 

- If detection is necessary, how quickly can the detection 
be performed and responded to? 

- Is detection enough?... Does the system require 
correction? 
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Radiation Hardened (per SEU) versus 
Commercial FPGA Devices 

• A radiation hardened (per SEU) FPGA is a device that 
has embedded mitigation. 

• Radiation hardened FPGA devices are available to 
users. They make the design cycle much easier! 

» They are considered hardened if: 

• Configuration susceptibility is reduced to an 
acceptable rate. 

• Generally, less than one node per 1x1 O' 8 days. 

• Be careful: with millions of nodes, this can translate 
into 1 or two configuration failures per year. 

• However, if the node isn’t being used, then your 
circuit may not be affected by the failure. 
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Radiation Hardened versus Commercial FPGA 
Device Geometries And Gate Count 

As Geometries Get Smaller, More Gates Are Available for Mitigation 

Virtex UltraScale+ 



Kintex UltraScale+ 
Virtex UltraScale 
Kintex UltraScale 

Virtex-7 
Virtex-7Q 
Stratix 5 
Virtex 5 
Virtex 5QV 
Virtex 4QV and Virtex 4 

RT-ProASIC 

RTAX-S 


Courtesy of Synopsys 



12 3 4 

Logic Capacity - Millions 
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16nm 


20nm 


28nm 


65nm 

90nm 

130nm 

150nm 
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FPGA Devices Listed by Configuration 
Type (Not All Are Included in The List): 

Embedded Mitigation 


Manufacturer 

Configuration 

Type 

Short List of 
Device Families 

Embedded 

Mitigation 

Altera 

SRAM 

Stratix 

No 

Microsemi 

Antifuse RTAX, RTSXS Clocks +DFFs 

(configuration is 
already hardened 
by nature) 

Microsemi 

Flash 

ProASIC3 

Configuration is 
already hardened 
by nature. 

Xilinx 

SRAM 

Virtex, Kintex 

No 

Xilinx 

Hardened SRAM 

Virtex V5QV 

Configuration + 
DICE DFFs + 
SET filters 


Go to http://radhome.gsfc.nasa.gov, manufacturer websites, and other space 
agency sites for more information on SEU data and total ionizing dose data. 
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FPGA Devices Listed by Configuration Type 
(Not All Are Included in The List): Susceptibility 


Configuration Type 

Short List of 
Device Families 

Embedded 

Mitigation 

Most Susceptible 
Components 

SRAM 

Stratix, Virtex, 
Kintex 

No 

Configuration 

Antifuse 

RTAX, RTSXS 

DFFs and clocks Combinatorial logic 

(configuration is (however 

already hardened by susceptibility 
nature) considered low) 

Flash 

ProASIC3 

Configuration is 
already hardened by 
nature. 

DFFs and clocks 

Hardened SRAM 

Virtex V5QV 

Configuration + Clocks. In some 

DICE DFFs + SET cases additional 

filters mitigation may be 


Go to http://radhome.gsfc.nasa.gov, manufacturer 
websites, and other space agency sites for more 
information on SEU data and total ionizing dose data. 


necessary for 
configuration and 
DFFs 
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NASA and other Government Agency 
FPGA Device Selection for Critical 

Applications 

» Currently, the most common FPGA devices used 
for NASA driven critical space applications are 
anti-fuse. 

• This is also true for other government agencies. 

• However, due to cost of implementation and 
robustness of design, SRAM-based FPGAs are 
becoming more popular. 

» The usage of SRAM-based FPGA devices 
introduces a variety of challenges for critical 
operations because their SEU susceptibility and 
reduced security. 
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Preliminary Design Considerations for 
Mitigation And Trade Space 

Determine Most Susceptible Components: 

^ error ^Configuration ft ) functionalLogic + ^SEFI 



Does the designer need to add 
mitigation? 

Will there be compromises? 

- Performance and speed, 

- Power, 

- Schedule 

- Mitigating the susceptible 
components? 

- Reliability (working and mitigating 
as expected)? 

act to speed, power, area, reliability, and 
schedule are important questions to ask. 
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Fail-safe Strategies for Single Event 

Upsets (SEUs) 

• The following slides will demonstrate commonly used 
mitigation strategies for FPGA devices. 

• What you should learn: 

- The differences between FPGA mitigation 
strategies. 

- Strengths and weaknesses of various strategies. 

- Questions to ask or considerations to make when 
evaluating mitigation schemes. 

- Which mitigation schemes are best for various 
types of FPGA devices. 

• The scope of this presentation will cover fail-safe 
strategies for configuration and data-path SEUs 
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Tjp*' 




is^tusu 

vr- : - 


Single Event Upsets and FPGA w 

Configuration 


/ P +p/r«) +P 

configuration \f s functionalLogic SEFI 
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Programmable Switch Implementation and 

SEU Susceptibility 



ANTIFUSE (one time programmable) 



Antifuse 


Antifuse 


Metal 3 


Configuration. 
SEU IMMUNE 


Logic 


Logic 




TS25Jv04 W#47 HPO.VPD 


SRAM (reprogrammable) 

Q 


Read or Write 



i 


Q 

— 

Data 




Programming Bit 


M3 


Configuration: 



To be presented by Melanie Berg via WebEx at SERRESSA 2015 International School on the Effects of Radiation on Embedded Systems for 
Space Applications, Puebla, Mexico, November 30 to December 4, 2015. 


47 




Configuration SEU Test Results and 


the REAG FPGA SEU Model 

^0 error ^ ^Configuration functional Logic "* ~ L S efi 


FPGA 

Configuration 

Type 

REAG Model 

Antifuse 

error ^ ^C/^0 functionalLogic ^SEFI 

SRAM (non- 
mitigated) 

error ^ ^Configuration 

Flash 

error ^ functionalLogic ^SEFI 

Hardened SRAM 

j 

terror ^ ^Co^glH^^ion functionalLogic ^SEFI 
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What Does The Last Slide Mean? 


FPGA 

Configuration 

Type 

Susceptibility 

Data-path: Combinatorial Logic (CL) and Flip-flops (DFFs); 
Global: Clocks and Resets; 

Configuration 

Antifuse 

Configuration has been designated as hard regarding SEEs. 
Susceptibilities only exist in the data paths and global routes. 
However, global routes are hardened and have a low SEU 
susceptibility. 

SRAM (non- 
mitigated) 

Configuration has been designated as the most susceptible portion 
of circuitry. All other upsets (except for global routes) are too 
statistically insignificant to take into account. E.g., it is a waste of 
time to study data path transients, however clock transient studies 
are significant. 

Flash 

Configuration has been designated as hard (but NOT immune) regarding 
SEEs. Susceptibilities also exist in the data paths and global routes (e.g., 
clocks and resets). 


Hardened SRAM Configuration has been designated as hardened (but NOT hard) 

regarding SEEs. Susceptibilities also exist in the data paths and 
global routes (e.g., clocks and resets). 
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Example: Routing Configuration 
Upsets in a Xilinx Virtex FPGA 



Because multiple paths can pass through the routing matrix, this 
configuration can be catestrophic - i.e., break simple mitigation 
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0 


0 


0 



Traditional SRAM ... One Data Word at 

a Time 

Upsets have no effect until 
Address containing upset is 
read out of SRAM 



Error detection and 
correction (EDAC) are placed 
after data out 

EDAC circuits only work one 
data word at a time 
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Configuration SRAM is NOT Utilized the 
Same Wa y as Traditional SRAM 

Every used bit is visible 

JXFX 




• Direct connections 
from configuration to 
user logic. 

* Upset occurs in a use< 
configuration bit then, 
upset occurs in logic. 

• We’re not dealing with data words anymore. Traditional SRAM 
EDAC schemes don’t quite apply for configuration SRAM 
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Scrubbers: Blind versus Read-back 


Blind Scrubber 

Write golden configuration 
into configuration 

Scrub cycle in the order of 

ms 

Pros: simple, less area and 
power, no need for 
additional non-volatile 
memory, very fast (great for 
accelerated testing) 

Cons: Write pointer can get 
hit during writing and write 
bad data into configuration- 
however, insignificant 
probability of occurrence 
(proven in heavy ion SEU 
testing) 


Read-back 

Read configuration, calculate 
correct data; if there is an 
upset, write corrected data. 

Scrub cycle in the order of s 

Pros, only writes if there is an 
upset 

Cons, additional non-volatile 
memory required; slow (only 
a problem for accelerated 
testing); takes more area and 
power; Correction scheme 
can break (e.g. be limited to 
detecting and correcting one 
upset); Consequently, upon 
an MBU can write bad data to 
configuration - this has been 
proven via heavy-ion testing. 
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Scrubbers: Internal versus External (1) 

• Internal and external scrubbers are used to fix 
configuration bits: 

- Internal scrubber: is created out of hard cores that reside inside 
the FPGA device; or is created out of user fabric logic blocks 
located inside the FPGA device. 

- External scrubber is implemented in an separate device . 

• External scrubbers are usually implemented in anti-fuse 
FPGA devices. 

• Internal scrubbers are obviously more susceptible than 
external scrubbers. 
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Scrubbers: Internal versus External (2) 


• Internal scrubbers are usually implemented as read- 
back. Remember read-back scrubbers can break 
and write bad frames into the configuration due to 
MBUs. 

• Although configuration memory interleaving has 
been used in the newer SRAM-based FPGAs, a 
significant number of Multiple Bit Upsets (MBUs) 
have been observed via Naval Research Laboratory 
(NRL) laser testing. 

- Could be because of laser spot size. 

- More testing is expected to be performed early 2016. 
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Differentiate Scrubbing for Space Applications 
and Scrubbing for Radiation Testing 
Space Application Accelerated SEU Testing 


* Only scrub if there is 
mitigation 

* Make scrubber simple to 
reduce project risk 

* Do not scrub constantly - not 
necessary and not good for 
the system 

* Single error correction double 
error detection (SECDED) 
scrubbers may not work well 
due to multiple bit upsets 
(MBUs) 

* Blind scrubbing is the 
simplest scheme yet read- 
back will also work 


* We must scrub! 

* Particles cannot overtake the 
scrubber - i.e., scrubber must 
be fast enough to stop fast 
accumulation of configuration 
SEUs - SCRUB CONSTANTLY 

* SECDED scrubbing schemes 
do not work well during 
accelerated testing because 
of MBUs and accumulation 

* Generally no time for read- 
back of configuration - hence 
blind scrubber is the best fit 
for accelerated testing 


To be presented by Melanie Berg via WebEx at SERRESSA 2015 International School on the Effects of Radiation on Embedded Systems for 
Space Applications, Puebla, Mexico, November 30 to December 4, 2015. 


56 


Warning! 


• Fixing a configuration bit does not mean that you 
have fixed the state in the functional logic path. 

• In order to guarantee that the functional logic is 
in the expected state after the configuration bit is 
fixed, either the state must be restored or a reset 
must be issued. 


Reliably getting to an expected state after a 
configuration-bit SEU (that affects the design’s 
functionality) requires one of the following: 

- Fix configuration bit + (reset or correct DFFs) or 

- Full reconfiguration. 
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Example: Routing Configuration 
Upsets in a Xilinx Virtex FPGA 



Configuration + design state must be corrected after a configuration 

SEU hit. 
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Single Event Upsets in an FPGA’s Functional 
Data Path and Fail-Safe Strategies 

P configuration^^* (fo) f U nctionalLogic^~P SE FI 
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Data-path SEUs and Their Affect At The 

System Level 





• A system implemented in an FPGA is a 
cascade of sequential and combinatorial 
logic. 

• The occurrence of an SET or SEU does not 
definitively cause system error. 

• Probability of a system error due to an SEU 
depends on many factors: 

- Probability of fault generation in a gate (SET or 
SEU). 

- Probability of error propagation - will the SET or 
SEU force the system’s next state to be incorrect? 


To be presented by Melanie Berg via WebEx at SERRESSA 2015 International School on the Effects of Radiation on Embedded Systems for 
Space Applications, Puebla, Mexico, November 30 to December 4, 2015. 


60 


Probability of Error Propagation in A 

Data-Path 

Upsets usually occur between clock cycles: Can 
cause a system-level malfunction if the SET or SEU 
will force the system’s next state to be incorrect. 

• Capacitive filtration: data-path capacitance can stop 
transient upset propagation; e.g.: 

- Routing metal or heavy loading. 

- If a transient doesn’t reach a sequential element, then it most 
likely will not cause a system upset. 

• Logic masking: 

- Redundancy and mitigation of paths can stop upset propagation. 

- Turned off paths from gated logic can stop upset propagation. 

• Temporal delay: path delays can block temporary SEUs 
from disturbing next state calculation. 
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Data-path SEU Susceptibility and 
Analysis : the NASA Electronics Parts 
and Packaging (NEPP) FPGA Model 



Berg M.,” FPGA SEE Test Guidelines”, NASA Radiation Effects 

and Analysis Group Website: 

https://nepp.nasa.gov/files/23779/FPGA_Radiation_Test_Guide 

lines_2012.pdf, July 2012. 
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Background: Synchronous Design Data 

Path - Sample and Hold 



Frequency 


• CL compute between clock edges. 

Designs are complex - We modularize for simplicity 
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Background: Synchronous Data Paths: 

StartPoint DFFs Td *> EndPoint DFFs , 

^dly'i 





End Point 
DFF 


T T+1 

I 

Every DFF has a function that 
determines its state 

EndPo int(r ) = /{StartPo int s(T - 1 ),CL) 



(A XOR B ) AND (C XOR D) 

* Datapath defined as StartPoint via CL to 
EndPoint. 

* CL and routes create delay (T d | y ) from 
StartPoints to EndPoints. 


• Every data path has a unique T d | y 

* T d | y is calculated using Static Timing 
Analysis (STA) design tools. 


Modularization: Every DFF has a unique cone of logic 
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DFF k Cone of Logic 


How can a DFF Contain an Incorrect 

State from aJSEU? 


DFFs have various modes of 
reaching a bad state due to SEUs. 

Attribute some modes to EndPoints 
and some to StartPoints. 


Wrong function = 
DFF State 


We make a clear distinction 
between DFF SEUs based on 
Clock state and Capture. 


End Point 
*1 DFF 

5 (AXORB) 




EndPoint DFF SEUs + StartPoint DFF SEUs + CL SETs 



EndPo ini 
DFF 


DFF upsets that 
occur at the clock 

edge. 


DFF upsets that occur 
between clock edges and 
are captured by 

EndPoints. 


Single Event 
Transients 
captured by 
EndPoints. 
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How Does a StartPoint SEU get Captured 

by an EndPoint? ■ 

Trth,\ ^ £ T clk 



Vo 


7 Ho 




r+Y 

Time Slack = r cjk 


dly 


(A XOR B ) AND (C XOR D) 

If DFF d flips its state @ time=r: 


Tdly = 9.5ns 


0<T<T c / k Triu. or 


dly 




elk 


Probability of capture: 

( T dl]/ T clk) = 1 " T dlyf s 
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( 


Details of Capturing StartPoint DFFs 


V 

DFF 


# SlartPo int DFFs 


\ 


( 




I 

j = i 


pPifs) 


DFFSEUU ) 


Upset generated 
internally to DFF 
between clock 
edges 




diy(j) 


f S )P\ogic(j)) 



J 


Design Topology^, 
and Temporal Topology and 

Masking Logic Masking 

SEU generation occurs in a StartPoint between rising clock 
edges {JiP(fs) dffseu) 

StartPoint upsets can be logically masked by logic 
between the StartPoint and its EndPoint 

Design topology and temporal effects: 

- Increase path delay (# of gates) - decrease probability of capture 

- Increase frequency - decrease probability of capture 
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Synchronous System: CL SET Capture 

Start Point 
DFFs 



To be presented by Melanie Berg via WebEx at SERRESSA 2015 International School on the Effects of Radiation on Embedded Systems for 
Space Applications, Puebla, Mexico, November 30 to December 4, 2015. 


68 








Details of CL SET Capture 




V 

OFF 


# Combinator ialCells 


\ 




z 


i = 1 


(F 



P P T M 

gen(i) prop(i) logic widlhyiX/ / 


Generation 




Logic Masking 




wid 


jlfpv 


^ r c!k 



Propagation: 

Electrical Masking 
from routes and gate 
cut-off frequencies 


Width of SET 
relative to 
clock period 

T clk 


• SET Generation ( P ) occurs between clock edges 

• EndPoint DFF captures the SET at a clock edge 


- Increase frequency - increase probability of capture 

- Increase CL - increase probability of capture 
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NEPP FPGA Model: Putting it All Together - 
Analyzed Per Particle Linear Energy Transfer (LET) 



EndPoint ( 


aP(fs) 


DFFSEU (k) 


+ 


EndPoint 


\ 


#EndPoint Logic 

DFFs ii « ■ 

Masking 

D ^ 

1 logic(k ) 


l 


k = 1 


#StartPomt DFFs Sfa/tPO/VifS 

/ f ( PP(f S )DFFSE [/(/)(! “ 7 dly(j)f S )) * ^logic(j)) + 

7 = 1 


#CL 


CL / 

StartPoints and CL need to be captured by an EndPoint... 

hence data path derating factors exist. 

Component Contribution to o SEU across Frequency and Gate Count 


V 


I 


(3 


gen 


(0 


* P, 


prop 


(0 


* Plogic ^ * T w idth^f s) 


Frequency # of Gates in Path 


EndPoint Directly Proportional N/A 

StartPoint Inversely Proportional Inversely Proportional 

CL Directly Proportional Directly Proportional 
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Current Use of NEPP FPGA Model 


EndPoint ( 


aP(fs) 


DFFSEU (k) 


+ 


EndPoint 


\ 


#EndPoint Logic 

DFFs ii « ■ 

Masking 

D ^ 

1 logic(k ) 


l 


k = 1 


#StartPomt DFFs Sfa/tPO/VifS 

/ f ( PP(f S )DFFSE [/(/)(! “ 7 dly(j)f S )) * ^logic(j)) + 

7 = 1 


#CL 




^ ^ (Pgen^^ * Pprop ^ * Plogic ^ * T width ^ fs) 


i= 1 


CL 


Currently, model is used to better understand 

heavy-ion SEU data: 

Great for measuring mitigation scheme strength. 
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Fail-Safe Strategies 
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Dual Redundant Systems 
(Detection Systems) 
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Dual Redundancy Example 


Synchronization is not 
always easy or predictable 


Complex System 


» Complex System 



Compare 



— > 

Alert 

Recover 


• Dual redundant systems cannot correct; they can only 
detect. 


• Roll-back + dual redundancy is not a sufficient solution 
for systems with highly susceptible hardware. 

• Alert systems must be highly reliable and verifiable. 
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Mitigation - Fail Safe Strategies That 
Do Not Require Fault Detection but 
Provide SEU Masking and/or 

Correction: 

Triple Modular Redundancy (TMR) 
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TMR Schemes Use Majority Voting 


MaJorUyVoter = II a 12 + 70 a 12 + 70 a 71 


10 

11 

12 

Majority Voter 

0 

0 

0 

0 

0 

0 

1 

0 

0 

1 

0 

0 

0 

1 

1 

1 

1 
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0 

0 
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0 

1 

1 

1 

1 

0 

1 



1 


Triplicate and Vote 
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Triplicate and Vote 


> 
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TMR Implementation 

• As previously [illustrated, TMR can be implemented n a 
variety of ways. 

» The definition of TMR depends on what portion of the 
circuit is triplicated and where the voters are placed. 

• The strongest TMR mplementation will triplicate all 
data-paths and contain separate voters for each data- 
path. 

- However, this can be costly: area, power, and 
complexity. 

- Hence a trade is performed to determine the TMR 
scheme that requires the least amount of effort and 
circuitry that will meet project requirements. 

• Presentation scope: Block TMR (BTMR), Localized TMR 
(LTMR), Distributed TMR (DTMR), Global TMR (GTMR). 
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Block Triple Modular Redundancy: BTMR 


Can Only 

Mask 

Errors 


3x the error rate with 
triplication and no 

correction/flushing 

• Need Feedback or flushing to Correct 

• Cannot apply nternal correction from voted outputs 

» If blocks are not regularly flushed (e.g. reset), Errors 
can accumulate - may not be an effective technique 
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Copy 1 



Copy 2 



Copy 3 




Complex 

function 

with 

DFFs 


v 

0 

T 

1 

N 

G 

M 

A 

T 

R 

I 

X 



When BTMR is Beneficial: Examples of 

Flushable BTMR Designs 

• Shift Registers. 

• Transmission channels: It is typical for transmission 
channels to send and reset after every sent packet. 

• Lock-Step microprocessors that have relaxed 
requirements such that the microprocessors can be 
reset (or power-cycled) every so-often. 
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Transmission channel example: 



* Voter 




If The System Is Not Flushable, Then 
BTMR May Not Provide The Expected 

Level of Mitigation 



• BTMR can work well as a mitigation 
scheme if the expected MTTF » expected 
window of correct operation. 

• But... If the expected time to failure for one 
block is less than the required full- 
liveliness availability window, then BTMR 
doesn’t buy you anything. 

• If not thought out well, BTMR can actually 
be a detriment - complexity, power, and 
area, and false sense of performance. 
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Explanation of BTMR Strength and Weakness 
using Classical Reliability Models 



Relibility for 1 
block (R block ) 

Relibility for 
BTMR (R btmr ) 

Mean Time to 
Failure for 1 
block (MTTF block ) 

Mean Time to 
Failure BTMR 
(MTTF btmr ) 

e _At 

3 e - 2 ^ x -2 e" ^ 

1/A 

(5/6 A)= 0.833/A 


Reliabilityi'Simplex'System'^erus'Block'TMR'Version" 

System 2 

J — Systeml>=l/40 , Ifailure/dav)" 

System2'ft= 1/730 '{failure/day)’ 
BTMR'bfSystem'l" 
BTMR'bf'System'S' 



x= 


Failures 


SEU 

Overall: 


olt e 


MTTF B tmr < MTTF Block 


0 " 


500" 


1000" 1500" 

Days" 


Operating in this time interval will 
provide a slight increase in 
reliability. 

However, it will provide a relatively 
design. 

2000 " 
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What Should be Done If Availability 
Needs to be Increased? 



• If the blocks within the BTMR have a relatively high upset 
rate with respect to the availability window, then stronger 
mitigation must be implemented. 

• Bring the voting/correcting inside of the modules... bring 
the voting to the module DFFs. 

The following slides illustrate the various forms of TMR that 


include voter insertion in the data-path. 


TMR 

Nomenclature 

DFF: Edge triggered flip-flop; CL: Combinatorial Logic 

TMR 

Acronym 

Local TMR 

DFFs are triplicated 

LTMR 

Distributed TMR 

DFFs and CL-data-paths are 
triplicated 

DTMR 

Global TMR 

DFFs, CL-data-paths and global 

GTMR or 


routes are triplicated 

XTMR 
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Describing Mitigation Effectiveness Using 

A Model 


DFF: Edge triggered flip-flop 


CL: Combinatorial Logic 


PffsKrro^P, 


* _|_ P(fv ) 

configuration \J ' functionalLogic 

r A — 


+p 


SEFI 


P(/ S )dFFSEU ->■ SEU + P(f S ) SET^SEU 





Probability that an 
SEU in a DFF will 
manifest as an error 
in the next system 
clock cycle 



Probability that an 
SET in a CL gate will 
manifest as an error 
in the next system 
clock cycle 
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Local Triple Modular Redundancy (LTMR) 

• Only DFFs are triplicated. Data-paths are kept singular. 

* LTMR masks upsets from DFFs and corrects DFF upsets if feedback is 
used. 

Good for devices where DFFs are most 
susceptible and configuration and CL 
susceptibility is insignificant; e.g., 

Microsemi ProASIC3. 


.„4Comb 
Logic 


DFF 


■ 



Comb 

Logic 






Comb 

Logic 


LTMR 



P(fe) error^fc ^ configuration fun ction alL ogic + PsEFI 


r 



P(f$)m^^F+SEU + P(f S )sET^SEU 
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Adding LTMR to a Microsemi ProASIC3 

Device 


y 


IVTASA 




LET: Linear Energy Transfer; 
WSR: windowed shift register 


* Microsemi ProASIC3 
- DFFs are the most 
susceptible (to 
heavy-ion SEUs) 
data-path 
components. 

• Adding LTMR 
decreases design 
sensitivity to SEUs. 


Non Mitigated and Mitigated WSR$ with the ProASIC3... Regard 

the Frequency Trends 

1.6E-07 


LET = 20.3 NoTMR versus LTMR- checker pattern 



0.Q0E+QQ 5.00E+07 1.0QE+Q8 1.50E+Q8 2.00E+08 2.50E+O8 

Frequency (Hz) 
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Adding LTMR to a Microsemi ProASIC3 
Device versus RTAXs Embedded LTMR 


* At lower LETs, user 
inserted LTMR to the 
ProASIC3 has similar 
SEU response to 
Microsemi RTAXs 
series. 

* Higher LETs, clock tree 
upsets start to 
dominate and LTMR in 
the ProASIC3 is not as 
effective. 

* For most critical 
applications, these 
cross-sections will 
produce acceptable 
upset rates. 


LET: Linear Energy Transfer; 
WSR: windowed shift register 


Non Mitigated and Mitigated WSRs with the ProAS!C3.~ Regard 
the Frequency Trends 

1,6 £-07 

LET = Z03 NaTMR versus LTMR- checker pattern 




O.0OE+BO 5.OOE+07 1DOE+OS 1.5M-M38 

Frequency (Hz) 



Embedded 
LTMR in a DFF 
cell RTAXs 
series. 
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LTMR Should Not Be Used in An 
SRAM Based FPGA 




LUT 


Look Up Table: 
LUT 



LUT 


To be presented by Melanie Berg via WebEx at SERRESSA 2015 International School on the Effects of Radiation on Embedded Systems for 
Space Applications, Puebla, Mexico, November 30 to December 4, 2015. 


88 


Distributed Triple Modular Redundancy (DTMR) 

• Triple all data-paths and add voters after DFFs. 

• DTMR masks upsets from configuration + DFFs + CL and corrects 
captured upsets if feedback is used. 



* Good for devices where configuration or DFFs + CL are more 
susceptible than project requirements; e.g., Xilinx and Altera 
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Global Triple Modular Redundancy (GTMR) 

Triple all clocks, data-paths and add voters after DFFs. 

GTMR has the same level of protection as DTMR; however, it also 
protects clock domains. 

Good for devices where configuration or DFFs + CL are more 
susceptible than project requirements; e.g., Xilinx and Altera 

commercial FPGAs. 





/- 

"X 

/~ 

“X 

v_ 






a 




\ / 


MJ 









-£> 




B 




P(fJ 


my, LOW Low 

errof* ^i&mjif0iiation -* (is 


m Logic 




^ Lowered 



Low 


SEU 


ow 


To be presented by Melanie Berg via WebEx at SERRESSA 2015 International School on the Effects of Radiation on Embedded Systems for 
Space Applications, Puebla, Mexico, November 30 to December 4, 2015. 


90 



Theoretically, GTMR Is The Strongest 
Mitigation Strategy... BUT... 

• Triplicating a design and its global routes takes up a 
lot of power and area. 

• Generally performed after synthesis by a tool- not 
part of RTL. 

• Skew between clock domains must be minimized such 
that it is less than the feedback of a voter to its 
associated DFF: 

- Does the FPGA contain enough low skew clock 
trees? (each clock + its synchronized reset)x3. 

- Limit skew of clocks coming into the FPGA. 

- Limit skew of clocks from their input pin to their 
clock tree. 

• Difficult to verify. 
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Currently, What Are The Biggest Challenges 
Regarding Mitigation Insertion? 



• Tool availability... Synopsys is now available. 


* User’s are not selecting the correct mitigation scheme for their 
target FPGA. 

* Logic partitioning is not being performed when needed. 


FPGA Type 

LTMR 

DTMR 

GTMR 

Antifuse+LTMR: Microsemi 
RTAX or RTSX family 

r 

1 ????? 1 


Commercial SRAM: Xilinx 
and Altera devices 




Commercial Flash: 
Microsemi ProASIC family 


1 ????? 1 


Hardened SRAM: Xilinx 
V5QV 


1 ????? 1 



General Recommendation 

Not Recommended but may be a solution for some situations 
Will not be a good solution 
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User versus Embedded Mitigation V 

• A subset of user inserted mitigation strategies 
have been presented. 

• None of the strategies are 100% fail-safe. 

• Depending on the project requirements, and the 
target device’s SEU susceptibility, the most 
efficient mitigation strategy should be selected. 

• In most cases, devices with embedded 
mitigation do not require additional (user 
inserted) mitigation. 


To be presented by Melanie Berg via WebEx at SERRESSA 2015 International School on the Effects of Radiation on Embedded Systems for 
Space Applications, Puebla, Mexico, November 30 to December 4, 2015. 


93 







Synchronous 

FSM c !2f k 


Synchronous FSMs and SEUs 

A synchronous FSM is designed to determi nistically 
transition through a pattern of defined states 

A synchronous FSM utilizes 
DFFs to hold its current 
state, transitions to a next 
state controlled by a clock 
edge and combinatorial 
logic, and only accepts 
inputs that have been 
synchronized to the same 
clock 

FSM SEUs can occur from: 

- Caught data-path SETs 

- DFFSEUs 

- Clock/Reset SETs 


Inputs 
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5-State FSM Binary Encoding Example 


5-State Finite State-Machine 5-State Finite State-Machine 

^ Binary Encoding 



Example of an FSM used to control a 5-State FSM with each state 
peripheral device encoded as binary numbers. 


An SEU can change current state and cause a 

catastrophic event 
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How Do We Implement Fail-Safe 

FSMs? 


A 



• Question: A designer states that all FSMs 
have been implemented as “safe”, what do 
you expect? 

• Correction? Detection? Masking? 

- What does correction mean? 

- All mitigation shall be defined unambiguously 
by the requirements and by the designer. 
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Safe State Machines 

As currently defined by design tools and by some 
designers, the term “safe” state machine is a misnomer. 

Auto transitioning (“safe state-machine” ) is a reaction to 
a small subset of incorrect transitions (unmapped states) 
They do not correct or mask (protect) against incorrect 
transitioning. 

5-State Finite State-Machine 
Binary Encoding 


What happens if 
an SEU causes a 
transition from 
“001" to “101" ? 



State 

Mapped or 
Unmapped 

000 

Yes 

001 

Yes 

010 

Yes 

Oil 

Yes 

100 

Yes 

101 

No 

110 

No 

111 

No 
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Safe State Machines: What happens if an 
SEU causes a transition from “001” to 

“ 101 ” ? 

• As currently implemented, a “safe” state machine w II 
automatically transition to a reset (or “safe” state). 



• Problem: this could be detrimental to your system 


5-State Finite State-Machine 
Binary Encoding 



State 

Mapped or 
Unmapped 

000 

Yes 

001 

Yes 

010 

Yes 

Oil 

Yes 

100 

Yes 

101 

No 

110 

No 

111 

No 
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Problems with Current “Safe” FSM 

Definition 



Sounds more safe than 

what it really is. 5-State Finite State-Machine 


Does not do anything for 
incorrect transitions into 
mapped states. 

Does not correct the state: 

- Something that is supposed to 
be on will abruptly shut off. 

- Other FSMs or control logic 
can become unsynchronized 
with the bad FSM; with or 
without the automated jump to 
a “safe” state. 
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5-State Finite State-Machine 


Can Auto-transitioning Work for Your 

Mission? 

Auto-transitioning can work if 
incorrect sequencing of your FSM 
will not cause system failure; e.g. 
mathematical logic control. 

Auto-transitioning can be 
acceptable if it is used in 
conjunction with a detection flag. 

The detection flag must propagate 
to all necessary logic. 

But remember, there s no 
protection or detection with auto- 
transitioning when incorrectly 

transitioning to a mapped state. _ _ 

Auto-transitioning + detection is available with computer 
aided design (CAD) tools. 
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Implementing Corrective Logic for FSMs 

• FPGAs with hardened configuration: 

- LTMR: Triplicate each DFF and use a majority voter. 

• The triplication + voter is treated as one DFF 

• Encoding doesn’t change 

• Resultant FSM has 3 times the number of DFFs 
than the original encoding scheme. 

• Combinatorial logic (not including the voters) 
does not change 

- Hamming Code-3: requires a new encod ng scheme. 

• FPGAs with commercial SRAM configuration: 
DTMR is suggested. 

There are computer aided design tools (CAD) that can 
assist in adding all of the above mitigation strategies. 
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FSM Fault Tolerance: 


5-State Conversion to a Hamming Code-3 FSM 


5-State Binary Finite State-Machine 
Converted to Hamming -3 FSM 


Companion 




000111 



State 0 (State IDLE) and Its 
Hamming-3 Companion States 



Hamming Code-3 FSM Diagram for a 5 A closer look at a base-state 

Base-State FSM: Would need 5*7=35 (state 0) and its companion- 

FSM states to be represented... 6 DFFs states 
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ProASIC3 Heavy-Ion FSM SEU Testing 






Scale is Log-Linear 
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Some Thoughts 



To be presented by Melanie Berg via WebEx at SERRESSA 2015 International School on the Effects of Radiation on Embedded Systems for 
Space Applications, Puebla, Mexico, November 30 to December 4, 2015. 


105 


Concerns and Challenges of Today 
and Tomorrow for Mitigation Insertion 

* User insertion of mitigation strategies in most FPGA devices has 
proven to be a challenging task because of reliability, performance, 
area, and power constraints. 

- Difficult to synchronize across triplicated systems, 

- Mitigation insertion slows down the system. 

- Can’t fit a triplicated version of a design into one device. 

- Power and thermal hot-spots are increased. 

* The newer devices have a significant increase in gate count and 
lower power. This helps to accommodate for area and power 
constraints while triplicating a design. However, this increases the 
challenge of module synchronization. 

* Embedded mitigation has helped in the design process. However, it 
is proving to be an ever-increasing challenge for manufacturers. 

- We (users) want embedded systems: cheaper, faster, and less power 
hungry. 

- However, heritage has proven that for critical applications, embedded 
systems have provided excellent performance and reliability. 
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Summary 

* For critical applications, mitigation may be required. 

* Determine the correct mitigation scheme for your mission while 
incorporating given requirements: 

- Understand the susceptibility of the target FPGA and how it 
responds to other devices. 

- Investigate if the selected mitigation strategy is compatible to the 
target FPGA. 

- Calculate the reliability of the mitigation strategy to determine if 
the final system will satisfy requirements. 

- Ask the right questions regarding functional expectation, 
mitigation, requirement satisfaction, and verification of 
expectations. 

* Although it is desirable from a user’s perspective to have embedded 
mitigation, cost seems to be driving the market towards unmitigated 
commercial FPGA devices. Hence, it will be necessary for user’s to 
familiarize themselves with optimal mitigation insertion and usage. 
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